Self-similar community structure in organisations 
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The formal chart of an organisation is designed to handle routine and easily anticipated problems, but un- 
expected situations arise which require the formation of new ties so that the corresponding extra tasks can be 
properly accomplished. The characterisation of the structure of such informal networks behind the formal chart 
is a key element for successful management. We analyse the complex e-mail network of a real organisation with 
about 1,700 employees and determine its community structure. Our results reveal the emergence of self-similar 
properties that suggest that some universal mechanism could be the underlying driving force in the formation 
and evolution of informal networks in organisations, as happens in other self-organised complex systems. 
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Although the formal chart of an organisation is intended 
to prescribe how employees interact, ties between individu- 
als arise for personal, political, and cultural reasons [Jjj]. The 
understanding of the formation and structure of such infor- 
mal networks are key elements for successful management 
H Bh. The traditional way of investigating informal net- 
works within organisations consists of conducting surveys us- 
ing employee questionnaires. However, employees answers 
often contain subjective elements such as "political" motives 
and the worry about offending colleagues. Another significant 
limitation of the questionnaire based analysis is that time and 
effort costs make it prohibitively expensive to map the entire 
network even for medium sized organisations. The rapid de- 
velopment of electronic communications provides a powerful 
alternative to the traditional analysis of informal networks. In- 
deed, the exchange of e-mails between individuals in organ- 
isations reveals how people interact and allows mapping the 
informal network in a non-intrusive, objective, and quantita- 
tive way. 

We surmise that the exchange of e-mails between individ- 
uals in organisations reveals how people interact [Q, ||] and 
therefore provides a map of the real network structure behind 
the formal chart. We analyse the complex e-mail network of a 
real organisation with about 1,700 employees and determine 
its community structure [^, [7| |i| ^]. Our results reveal the 
emergence of self-similar properties that suggest that some 
universal mechanism could be the underlying driving force in 
the formation and evolution of informal networks in organi- 
sations, as happens in other self-organised complex systems 

Every time that an e-mail is sent, the addresses of the sender 
and the receiver are routinely registered in a server. Therefore, 
an e-mail network can be built regarding each e-mail address 
as a node and linking two nodes if there is an e-mail com- 
munication between them. In particular, we take as a case 
study the e-mail network of University Rovira i Virgili (URV) 
in Tarragona, Spain, containing around 1700 users (Fig. [I]). 
Bulk e-mails provide little or no information about how indi- 
viduals or teams collaborate and, once they are removed, the 
connectivity distribution of the e-mail network is exponential, 




FIG. 1 : The e-mail network of URV. The network comprises approx- 
imately 1700 users, including faculty, researchers, technicians, man- 
agers, administrators, and graduate students. We consider e-mails 
exchanged between university addresses during the first three months 
of 2002. Each individual is represented by a node, with two individ- 
uals (A and B) being connected if A has sent an e-mail to B and B 
has also sent an e-mail to A. Bulk e-mails provide little or no in- 
formation about how individuals or teams collaborate. To minimise 
their effect: (i) we eliminate e-mails that are sent to more than 50 
different recipients and (ii) we disregard links that are unidirectional, 
that is we consider only e-mails that represent a real communication 
link, where e-mails flow in both directions. With these two restric- 
tions, the network is undirected and is formed by a main component 
comprising 1133 nodes and many isolated nodes or pairs of nodes. 
These little islands are not plotted to keep the figure as simple as pos- 
sible. The colour of each node identifies an individual's affiliation to 
a specific centre within the university. 



P{k) oc exp(-fc/fc*) for k > 2, with k* = 9.2. This result is 
in contrast with recent findings indicating that some technol- 
ogy based social networks — such as rough e-mail networks 
[Q], the Instant Messaging Network JTl[ | or the PGP encryp- 
tion network [|l2|| — show heavily skewed degree distributions, 
but is consistent with the proposal of Amaral and coworkers 
that the truncation of the scale-free behaviour in real world 
networks is due to the existence of limitations or costs in the 
establishment of connections Indeed, it seems plausible 
that there are costs to maintaining active social acquaintances 
and therefore active communications. However, it is relatively 
easy to keep many electronic acquaintances open, although 
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FIG. 2: Community identification according to the GN algorithm, a, 
The betweenness of an edge is defined as the number of minimum 
paths connecting pairs of nodes that go through that edge [El], E2] ]. 
The GN algorithm is based on the idea that the edges which connect 
highly clustered communities have a higher edge betweenness — in 
this case, edge BE — and therefore cutting these edges should sep- 
arate communities. The algorithm proceeds by identifying and re- 
moving the link with the highest betweenness in the network. After 
every removal, the betweenness of the edges is recalculated. This 
process is repeated until the 'parent' network splits, producing two 
separate 'offspring' networks. The offspring can be split further in 
the same way until they comprise of only one individual, b, In or- 
der to describe the entire splitting process, we generate a binary tree, 
in which bifurcations (white nodes) depict communities and leaves 
(black nodes) represent individual addresses of the e-mail network. 
At the beginning of the process, the network is a single entity, rep- 
resented by node 1 in the tree. After the removal of the edge BE, 
the network is split into two subnetworks, 2 and 3, containing ad- 
dresses A to D and E to I respectively. The two offspring networks 
have no further internal community structure. Consider first, subnet- 
work 2 containing nodes A to D. When all the links are equivalent 
and have the same betweenness as in the present case, one of them 
will be selected at random for removal. It is straightforward to show 
that, iterating the link removal procedure, nodes will be separated 
one by one and randomly by the GN algorithm, generating a branch 
in the binary tree. As an example, the figure represents a situation in 
which B is separated first, then A, and finally D and C, but a dif- 
ferent random selection of links would lead to a different separation 
order. Similarly, in subnetwork 3 nodes will be separated one by one 
and at random, except for the fact that the most central node, E, will 
always be separated last. In general, for large networks in which the 
probability of having two links with the same betweenness is very 
small, it will still be true that communities will appear as branches 
in the community binary tree and that the tips of the branches will 
correspond to the most central agents in the network. 



most of them are probably inactive from a social point of view. 

To understand the structure of the informal network of the 
organisation, we are interested in determining how individuals 
interact and form groups that, in turn, interact with each other 
giving rise to higher order groups, that is, groups of groups. 
In other words, we want to unravel the community structure 
of the network. To do so we use the algorithm proposed re- 
cently by Girvan and Newman (GN) ^ to identify communi- 
ties in complex networks (see Fig. fy. The algorithm proceeds 
by splitting the network recursively until single nodes are left. 
The information about the community structure of the original 
network can be deduced from the topology of the binary tree 
that represents this splitting procedure and which leaves cor- 
respond to addresses of the e-mail network (Fig. |2|b). The dif- 
ferent communities of the original network appear as branches 



in this tree, which are easily identified by visual inspection. 
The community binary tree for URV is shown in Fig. |3| Each 
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FIG. 3: Communities in the e-mail network of URV. a, Binary tree 
showing the result of applying the GN algorithm to the e-mail net- 
work of URV. The position indicated by the arrow represents the root 
of the tree (equivalent to node 1 in figure |^b) and branches are de- 
picted so that they can be clearly differentiated. In particular, only 
the leaves of the tree, that correspond to e-mail addresses, are plot- 
ted, as shown in the detail that is zoomed. The colour of each of the 
leaves represents different centres within the university (five small 
centres containing less than 10 individuals are assigned the same 
colour). Nodes of the same colour (from the same centre) tend to 
stick together in the same branch meaning that individuals within 
the same department tend to communicate more, and that the algo- 
rithm is capable of resolving separate centres to a good degree of ac- 
curacy. The complicated branching structure resembles self-similar 
systems in nature such as river networks or diffusion-limited aggre- 
gates, b, Same as before but without showing the leaves. Branches 
are now coloured according to their Horton-Strahler index (see text) 
c, Binary tree showing the result of applying the GN algorithm to a 
random graph with the same size and connectivity than the e-mail 
network. The lack of community structure is reflected in the absence 
of branches in the tree, which contrasts with the intricate self-similar 
structure of a and b. Again, colours correspond to Horton-Strahler 
indices. 

colour in Fig. ||a corresponds to one centre of the university, 
that is to a faculty or college, or to management units such as 
the office of the Rector of the university. Two properties of 
the tree are worth noting. First, a clear branching structure 
emerges, with branches essentially containing nodes of the 



3 



same colour. This shows that the identification of communi- 
ties is successful, despite the complexity of the network. Sec- 
ond, the branching structure is far from simple. Indeed, each 
branch is formed, in general, by a system of nested smaller 
subbranches that give rise to a complicated structure that vi- 
sually resembles some self-similar systems in nature such as 
river networks Jl3| ] or diffusion-limited aggregates [|l4[]. For 
comparison, we also show the tree generated by the GN al- 
gorithm from a random graph of the same size and average 
connectivity as the e-mail network (Fig. |b). In contrast to 
the tree for the URV e-mail network, the branching structure 
is almost trivial with most of the branches containing only 1 
or 2 nodes. This is the expected result for a network that do 
not have any sort of community structure. 

Once the binary tree has been obtained, we look for a quan- 
titative characterisation of the community structure. First we 
consider the cumulative community size distribution, P(s), 
i.e. the probability of a community having a size larger or 
equal to s. Fig. [I|a shows how to compute this probability, and 
the resulting distribution for the e-mail network is presented 
in Fig. |]d. The distribution is heavily skewed, following a 
power law behaviour P(s) oc s~ a with a = 0.48 between 
s = 2 and s « 100. Beyond this value, the distribution shows 
a sharp decay and, at s w 1000, a cutoff that corresponds to 
the size of the system. The power law of the community size 
distribution suggests that there is no characteristic community 
size in the network (up to s « 100). To rule out the possibility 
that this behaviour is due to the community identification al- 
gorithm, we also consider the community size distribution for 
a random graph with the same size and average connectivity 
as the e-mail network. 

The characterisation of the community binary tree using the 
cumulative size distribution has its analogy in the river net- 
work literature [ pj[ |l5| , [l6[ ], The equivalent measure is the 
distribution of drainage areas, that represents the amount of 
water that is generated upstream of a given point (see Fig. 
[|b). The similitude between the community size distribution 
of the current e-mail network in Fig. ^d and the area distri- 
bution of the Fella river network in Italy reported in Fig. 2 of 
Ref. [|l6| is striking. The exponent a = 0.45 for the power 
law region of this river and the average exponent for several 
rivers a r i ver = 0.43 ± 0.03 respectively reported by [ |l6| ] and 
Jl5|], are very close to the current a = 0.48. Moreover, the 
behaviour shown in Fig. |]d with first a sharp decay and then a 
final cutoff is also shared by river networks, which are known 
to evolve to a state where the total energy expenditure is min- 
imised jl5[ [l7| , [l8[ ]. The possibility that communities within 
organisations might also spontaneously self-organise into a 
form in which some quantity is optimised is very appealing 
and deserves further investigation. 

To further understand this point, it is pertinent to ask the 
question: does the similarity between community trees in or- 
ganisations and river networks arise just by chance or are 
there other emergent properties shared by both? To answer 
this question we consider a standard measure for categorising 
binary trees: the Horton-Strahler (HS) index, originally intro- 
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FIG. 4: Self-similarity in the community structure, a, Calculation 
of the community size distribution for a binary tree generated by the 
community identification algorithm. Black nodes represent the ac- 
tual nodes of the original graph while white nodes are just graphical 
representations of communities that arise as a result of the splitting 
procedure. Nodes A and B belong to a community of size 2, and 
together with E form a community of size 3. Similarly, C, D and F 
form another community of size 3. These two groups together form a 
higher level community of size 6. Following up to higher and higher 
levels, the community structure can be regarded as the set of nested 
groups. The size, Si, of a community i is just the summation of the 
sizes of its two offspring ji and j'2 : Si = Sj ± + Sj 2 . In this case there 
are three communities of size 2, three communities of size 3, one 
community of size 6, one community of size 7, and one community 
of size 10. Note that a single node belongs to different communities 
at different levels, b, Calculation of the drainage area distribution 
for a river network. The drainage area of a given point is the num- 
ber of nodes upstream of it plus one. For a point i with offspring ji 
and j2, Si — Sj 1 + Sj 2 + 1. c, Calculation of the Horton-Strahler 
index. The index of a branch changes when it meets a branch with 
higher index, or when it meets a branch with the same value and both 
of them join forming a branch with higher index. In this case, there 
are 10 branches with index 1, 3 branches with index 2, and 1 branch 
with index 3. d, The distribution of community sizes, P(s), show- 
ing a power law region with the exponent -0.48, followed by a sharp 
decrease at s ~ 100 and a cutoff corresponding to the size of the sys- 
tem at s ~ 1000. The distribution of community sizes in a random 
network is shown with a dotted line for comparison, e, The number 
of branches with HS index i, as a function of i. From the definition of 
the branching ratio, it is straightforward to show that, when topolog- 
ical self-similarity holds, Ni = Ni/B^ . A fitting of this function 
to the points obtained for the e-mail community tree yields excellent 
agreement with B — 5.76. A much worse agreement is obtained for 
the community tree corresponding to the random network, with Bi 
fluctuating around 3.46. 



duced for the study of river networks by Horton Jl9[], and later 
refined by Strahler p0[]. Consider the binary tree depicted in 
Fig. Qc. The leaves of the tree are assigned a HS index i = 1. 
For any other branch that ramifies into two branches with HS 
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indices i\ and i%, the index is calculated as follows: 

. _ ( h + 1 if i% = %%, 

\ max(ii, i 2 ) if i\ 7^ 12- 

Note that the index of a branch changes when it meets a branch 
with higher index, or when it meets a branch with the same 
value and both of them join forming a branch with higher in- 
dex. In terms of communities, the interpretation of the HS 
index is the following. The index of a community changes 
when it joins a community of the same index. Consider, for 
instance, the lowest levels: individuals (i = 1) join to form 
a group (or team, with i = 2), which in turn will join other 
groups to form a second level group (or department, i = 3). 
Therefore, the index reflects the level of aggregation of com- 
munities. The number of branches Ni with index i can be 
determined once the HS index of each branch is known . The 
bifurcation ratios Bi are then defined by Bi = Ni/Ni + \ (by 
definition B, > 2). 

When Bi w B for all i, the structure is said to be topolog- 
ically self-similar, because the overall tree can be viewed as 
being comprised of B sub-trees, which in turn are comprised 
of B smaller sub-trees with similar structures and so forth for 
all scales. River networks are found to be topologically self- 
similar with 3 < B < 5 [[l4]]. We find that the community tree 
as generated by the process described above is topologically 
self-similar with Bi « B = 5.76 (see Fig. ^). The same 
analysis for the communities in a random graph shows that 
topological self-similarity does not hold, since the values Bi 
are not constant; they fluctuate around a smaller 3.46 value. 

The methods presented here open interesting doors regard- 
ing the possibility of mapping the informal network of large 
organisations in a non-intrusive, objective, and quantitative 
way. Moreover, the emergence of scaling and self-similarity 
in the community structure, as well as the similarity with river 
networks, raises important questions about the mechanisms 
underlying the interactions between individuals within an or- 
ganisation. Self-similarity is a fingerprint of the replication 
of the structure at different levels of organisation, and could 
be the result of the trade-off between the need for cooperation 
and the physical constrains to establish connections at any or- 
ganisational level. At the same time, the similitude with river 
networks suggests that a common principle of optimisation — 
of flow of information in organisations or of flow of water in 
rivers — could be the underlying driving force in the formation 
and evolution of informal networks in organisations. 
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